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ABSTRACT 

The Automatic Interaction Detector (AID) is discussed 
as to its usefulness in multiple regression analysis. The algorithm 
of AID- 4 is a reversal of the model building process; it starts with 
the ultimate restricted model, namely, the whole group as a unit. By 
a unique splitting process maximizing the between sum of squares for 
the categories of each variable while minimizing the error sum of 
squares (within group sum of squares) , AID- 4 seeks out that variable 
which has the largest between sum of squares and splits the original 
group into two mutually exclusive groups on this variable at thav 
category where the maximum between sum of squares occurred. The major 
advantage of using AID" 4 is that the maximum squared composite 
correlation is obtained without the task of attempting to identify 
the various relevant combinations of linear and non-linear 
interaction terms by trial and error necessary in the full model of 
the multiple regression technique. (Author/DB) 
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Multiple regression analysis is a powerful approach to the formula- 
tion and the analysis of research problems, and the testing of hypotheses. 

It is less restrictive than multiple correlational analysis; e.g., multiple 
regression analysis does not assume that the predictor variables constitute 
a multivariate normal distribution. The absence of this restriction permits 
the introduction of categorical predictor variables. One use for such 
variables is the establishment of mutually exclusive groups and the testing 
of the hypothesis that knowledge of group membership at different levels 
of a predictor variable Improves the accuracy of prediction of a criterion 
of interest. The automatic interaction detector improves the power and 
efficiency of the application of multiple regression analysis through 
the identification of optimal configurations of predictor variables for 
criterion prediction. Joint familiarity with regression techniques and 
the application of the automatic interaction detector will provide the 
research scientist with an effective tool. Without the automatic inter- 
action detector, the establishment of optimally effective sets of predictor 
variables is essentially a cut-and-try, guesswork process. With automatic 
interaction detection, guidance is offered directly as to the optimal 
prediction possible with the predictor set, and the identification of 
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reduced subsets of predictors which most closely approximate the total 
validity of the full set of predictors. In this sense, AID-4 is a model 
identifying process. 

The multiple regression technique as illustrated by Bottenberg and 
Ward (1963), starts with a K-category full regression model including all 
the predictor variables (categorical and/or continuous) and the basic 
procedure consists of testing for the significance of the difference 
between the error sum of squares resulting when some of the least-square 
weighted categorical memberships are not takeu into account in the (K-n)- 
category restricted model where n is the number of restrictions imposed 
upon the full model. The test of significance is done by the F-statistic, 
comparing the minimized error sum of squares of the full model with that 
of the restricted model. This comparison Indicates the extent to which 
the eliminated n categorical memberships contributed to the accuracy of 
predicting the criterion variable. 

For a simple example, let us suppose that we have two predictor 
varial s x^ with three levels, i.e., high school degree, undergraduate 
degree and graduate degree; and x^ with two levels, i.e., pilot or 
navigator. The criterion variable is some test score on a 50-item test 
and we have 60 individuals in the experiment. (The actual data was 
taken from an example in Hays' Statistic , Holt, Rinehart and Winston, 

1963, p. 403.) The simple two predictor, one criterion multiple 

l 

regression model is: 

Model 1 y ■ a^u + + e l 
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which after the conventional multiple regression yields a 
2 

solution of » .7508 and a mlmimized error sum of squares of 33 
1607.4670. 



Testing for infraction one would include a product term in the 
model: 

Model 2 y ■ b^u + b^x^ + b 2 x^ + b^x^ • x^) + e 2 
Model 2 is the so called "full model" and Model 1 is the "restricted model." 
It is restricted because we impose the restriction of b^ * 0 upon Model 2 
thus obtaining Model 1. By comparing the minimized error sums of squares 
of Model 1 and Model 2 , q-^ and q 2 respectively, one gets an indication 
of the contribution of the product term (or "interaction") to the 
predictive efficiency of the system. The solution of Model 2 gives an 
R 2 - .8184 and q 2 - 1171.8683. 



The F-statlstlc is computed by: 



F 



Ul - q 2 )/(* - 3) 

q 2 /(60 - 4) 



20.82 



with df ■ 1 and 56. We can make further "guesses" about the predictor 
variables. Let us assume that predictor x^ has a quadratic component 
and that the previously hypothesized interaction is also present. Our 
model will look like: 2 

Model 3 y * c^u + CjX^ + C 2 X ^ * c^x^ • x^ + c^ • e^ 

The solution of Model 3 yields an Rj ■ .8423 and a minimized error sum 
of squares q^ ■ 1017.7627. The F-statlstlc Is: 

V - (q 2 ~ q 3 )/(5 ~ 4) - 8.33 
q 3 /<60 - 5) 
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with df 



=> 1 and 55. Additional possible models are listed below: 
Model 4 y 4 ■ dgu + djX^ + + ^x^x^ + ^4 

q 4 « 1171.8683 



.8184 



Model 5 



*5 



R^ 



k~u + k„x^ + koX^ + k^x^x^ + k. Ix^l + k 



v 0 



.8423 



[ x 









H* ■> 



1017.7627 



It should be obvious at this point that had we had a more complex problem, 

for example 40 predictor variables with 10 levels each, the guesswork 

would have been futile and totally unreasonable. The number of possible 

40 

mutually exclusive categories in the model would be 10 , most of which 

would be empty, considering that the total population of the earth is 
approximately 4 x 10 . 

This was the reason for implementing and developing AFHRL's version 
of AID-4. The algorithm of AID-4 is a reversal of the model building 
process. Rather than starting with a full model, Including all possible 
predictors and their simple and complex Interactions, AID-4 starts with 
the ultimate restricted model, namely, the whole group as a unit. By a 
unique splitting process maximizing the between sum of squares (BSS) for 
the categories of each variable while minimizing the error sum of squares 
(within group sum of squares) AID-4 seeks out that variable which has the 
largest BSS and splits the original group into two mutually exclusive 
groups on this variable at that category where the maximum BSS occurred. 
For example, given an 80 variable problem with 10 categories per variable, 
if the maximum BSS was found in Variable 9 and between categories 1, 2, 

3 and 4, 5, 6, 7, 8, 9, 10; the original Group 1 will be split into two 
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mutually exclusive groups : (a) Group 2 consisting of those individuals 

whose response to Variable 9 was 1 or 2 or 3, and (b) Group 3 consisting 
of the remainder of the individuals whose response to Variable 9 was A, 5, 
6, 7, 8, 9 or 10. In actuality, AID-A has identified the first level full 
model consisting of 2 groups. The test of significance is an F-test 
comparing the minimized error sum of squares of the full model (2 groups) 
and the restricted model (original 1 group). The test of significance 
for the first split is equivalent to an F-test obtained by a one-way 
analysis of variance comparing the 2 groups on the criterion variable. 

The process continues until a specified stop-criterion is reached. Each 
time a split occurs, the resulting j mutually exclusive groups represent 
the full model, and the minimized error sum of squares of this model is 
compared with the error sum of squares of the previous model, consisting 
of (j-1) mutually exclusive groups. The final split represents an optimal 
full model which could have been hypothesized before starting to impose 
restrictions. Going from the final model with the last split towards 
the original unspllt group, each tinspllt group represents an additional 
restriction. 

For our example, the AID-A splitting process is illustrated in 
Figure 1. Going down the branches of the tree-pattern, one can identify 
the simple and complex Interactions of the optimum polynomial multiple 
regression equation. We know that we have predictor variables x^^ and 
The first two splits occurred on x^, x^ respectively, hence 



we have an 




The first three splits occurred on x^ 
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FIGURE 1. SPLIT DIAGRAM 
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x v ' term. The second branch 
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( 2 ) 

x v respectively, hence we have an 
from the left is identical to the first identifying the same • x^ 

term. The third branch from the left split on x respectively, 

hence we have an £x^ * x^)| term. 

Thus, the optimal model is: 

Model 6 

y * p Q u + PjX^ 1 * + P 2 X ^ 2 ^ + + p£ [,<»] + p 5 (x^ ‘ x ^ + e g 

2 

which yields, after conventional solution, an R - .9003 which is the same 
as AID-4 arrived at after the final split. Note that Model 6 does not 
contain a term [x^ 2 ^] which is consistent with the previous findings 
namely that Model 3 and Model 5 were identical (the only difference 
being that Model 5 contained t (2) ] >• 

The major advantage accruing to the task scientist using AID-4 is 
obtaining the maximum squared composite correlation without the task of 
attempting to identify the various relevant combinations of linear and 
non-linear Interaction terms by trial and error necessary in the full 
model of the multiple regression technique. AID-4 automatically identifies 
these terms. The means of the final categorical groups are the proper 
weights to be assigned for each of those groups in predicting the criterion 
variable. An additional major advantage is that out of a regression 
analysis with a large number of predictor variables, there may be only 
a small subset of predictor variables which are of significance in the 
prediction system. AID-4 identifies such a subset of predictors 
automatically. Finally, the branching pattern facilitates interpretation 
of the results. In our sample example, it is much more meaningful to 
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identify Group 6 on Figure 1 as pilots who have advanced academic degrees 
and who have a predicted score of 46.40, than in a polynomial regression 



equation where one would have to square "educational level" and multiply 



a large prediction system, attempts to Identify and include all possible 
combinations of interaction terms represents a practical impossibility 
without the help of AID-4. 

Many additional and useful bits of information are provided by the 
output of AID-4, some of which are! (1) at each split, the increased 
present total explained variance (R ) is printed, together with a 
statistical test of significance for the difference between the error sum 
of squares of the new model and the previous model prior to the split; 

(2) the splits occur in a descending order of importance, that is, the 
first split identifies that variable which contributes the most to the 
explained variance; the second split identifies the second variable or 
a subset of the first split as the next most important contributor to 
the explained variance; and so on. This hierarchy is very helpful 
especially if after a few splits a reasonably high R is obtained, thus 
giving the researcher an option of using only a few predictors in the 
prediction system; (3) the branching pattern of splits reflects trends 
of characteristics specific to the groups split; that is, it can serve 
as an "eyeball" pattern analysis. Following the path of each branch of 
the split- tree, one can Identify major characteristics of the final 
groups on which they differ the most in light of the criterion measure; 



it by "pilotness" in order to identify 
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(A) cross-validation and double cross-validation options which either 

splits the original sample into two random samples or takes two given 

samples, treats each sample separately, determining an optimal split 

2 

pattern for each and the associated R . Then it forces the split pattern 
of Sample 1 upon Sample 2 and vice-versa computing a squared composite 
correlation for these forced splits. The differences between the optimal 
for each sample and the corresponding squared composite correlation 
obtained by forced splitting is a good indicator of the stability of the 
system; (5) selective or "partial" effects of the predictors are 
identified such that even if the so-called "main effect" of a particular 
variable in a complex analysis of variance results in a non-significant 
F-ratio, AID-4 selectively indicates the level on the other variable (s) 
at which this non-significant effect becomes significant. 

Copies of the write-up and program (to be loaded on a tape provided 
by the user) can be obtained by written request from Dr. Janos Koplyay, 
Chief, Statistical and Computer Technology Section, AFHRL/PHSM, Lackland 
AFB, Texas 78236. 
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